Regret Minimization in Nonstationary Markov Decision Processes

نویسندگان

  • Jia Yuan Yu
  • Shie Mannor
چکیده

We consider decision-making problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion to some extent. We propose online learning algorithms and provide guarantees on their performance evaluated in retrospect against stationary policies. Unlike previous works, the guarantees depend critically on the variability of the uncertainty in the transition probabilities, but hold regardless of arbitrary changes in rewards and transition probabilities. First, we use an approach based on robust dynamic programming and extend it to the case where reward observation is limited to the actual state-action trajectory. Next, we present a computationally efficient simulation-based Q-learning style algorithm that requires neither prior knowledge nor estimation of the transition probabilities. We show both probabilistic performance guarantees and deterministic guarantees on the expected performance1.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Better Rates for Any Adversarial Deterministic MDP

We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the state-of-theart forward in two ways: First, it attains a regret of O(T ) with respect to the best fixed policy in hindsight, whereas the previous best regret bound was O(T ). Second, the algorithm and its analysis are compatible with any...

متن کامل

Inference-based Decision Making in Games

Background: Reinforcement learning in complex games has traditionally been the domain of valueor policy iteration algorithms, resulting from their effectiveness in planning in Markov decision processes, before algorithms based on regret minimization guarantees such as upper confidence bounds applied to trees (UCT) and counterfactual regret minimization were developed and proved to be very succe...

متن کامل

Stochastic Regret Minimization for Revenue Management Problems with Nonstationary Demands

Problems with Nonstationary Demands Huanan Zhang∗, Cong Shi∗, Chao Qin†, Cheng Hua‡ ∗ Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI 48109 {zhanghn, shicong}@umich.edu † Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208, [email protected] ‡ Yale School of Management, Yale University, New Haven, CT 06511, cheng....

متن کامل

Reinforcement Learning Algorithms for Regret Minimization in Structured Markov Decision Processes: (Extended Abstract)

A recent goal in the Reinforcement Learning (RL) framework is to choose a sequence of actions or a policy to maximize the reward collected or minimize the regret incurred in a finite time horizon. For several RL problems in operation research and optimal control, the optimal policy of the underlying Markov Decision Process (MDP) is characterized by a known structure. The current state of the ar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010